empty space
On the Limits of Innate Planning in Large Language Models
Schepanowski, Charles, Ling, Charles
Large language models (LLMs) achieve impressive results on many benchmarks, yet their capacity for planning and stateful reasoning remains unclear. We study these abilities directly, without code execution or other tools, using the 8-puzzle: a classic task that requires state tracking and goal-directed planning while allowing precise, step-by-step evaluation. Four models are tested under common prompting conditions (Zero-Shot, Chain-of-Thought, Algorithm-of-Thought) and with tiered corrective feedback. Feedback improves success rates for some model-prompt combinations, but many successful runs are long, computationally expensive, and indirect. We then examine the models with an external move validator that provides only valid moves. Despite this level of assistance, none of the models solve any puzzles in this setting. Qualitative analysis reveals two dominant deficits across all models: (1) brittle internal state representations, leading to frequent invalid moves, and (2) weak heuristic planning, with models entering loops or selecting actions that do not reduce the distance to the goal state. These findings indicate that, in the absence of external tools such as code interpreters, current LLMs have substantial limitations in planning and that further progress may require mechanisms for maintaining explicit state and performing structured search.
- North America > United States > Virginia (0.04)
- Europe > France (0.04)
Efficient On-Policy Reinforcement Learning via Exploration of Sparse Parameter Space
Zhang, Xinyu, Deb, Aishik, Mueller, Klaus
Policy-gradient methods such as Proximal Policy Optimization (PPO) are typically updated along a single stochastic gradient direction, leaving the rich local structure of the parameter space unexplored. Previous work has shown that the surrogate gradient is often poorly correlated with the true reward landscape. Building on this insight, we visualize the parameter space spanned by policy checkpoints within an iteration and reveal that higher performing solutions often lie in nearby unexplored regions. To exploit this opportunity, we introduce ExploRLer, a pluggable pipeline that seamlessly integrates with on-policy algorithms such as PPO and TRPO, systematically probing the unexplored neighborhoods of surrogate on-policy gradient updates. Without increasing the number of gradient updates, ExploRLer achieves significant improvements over baselines in complex continuous control environments. Our results demonstrate that iteration-level exploration provides a practical and effective way to strengthen on-policy reinforcement learning and offer a fresh perspective on the limitations of the surrogate objective.
- North America > United States > New York > Suffolk County > Stony Brook (0.05)
- Europe > France (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.86)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.75)
Ludax: A GPU-Accelerated Domain Specific Language for Board Games
Todd, Graham, Padula, Alexander G., Soemers, Dennis J. N. J., Togelius, Julian
Games have long been used as benchmarks and testing environments for research in artificial intelligence. A key step in supporting this research was the development of game description languages: frameworks that compile domain-specific code into playable and simulatable game environments, allowing researchers to generalize their algorithms and approaches across multiple games without having to manually implement each one. More recently, progress in reinforcement learning (RL) has been largely driven by advances in hardware acceleration. Libraries like JAX allow practitioners to take full advantage of cutting-edge computing hardware, often speeding up training and testing by orders of magnitude. Here, we present a synthesis of these strands of research: a domain-specific language for board games which automatically compiles into hardware-accelerated code. Our framework, Ludax, combines the generality of game description languages with the speed of modern parallel processing hardware and is designed to fit neatly into existing deep learning pipelines. We envision Ludax as a tool to help accelerate games research generally, from RL to cognitive science, by enabling rapid simulation and providing a flexible representation scheme. We present a detailed breakdown of Ludax's description language and technical notes on the compilation process, along with speed benchmarking and a demonstration of training RL agents. The Ludax framework, along with implementations of existing board games, is open-source and freely available.
- Europe > Switzerland > Zürich > Zürich (0.14)
- Europe > Austria > Vienna (0.14)
- Oceania > Australia > Queensland (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.88)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)
Generalizing End-To-End Autonomous Driving In Real-World Environments Using Zero-Shot LLMs
Dong, Zeyu, Zhu, Yimin, Li, Yansong, Mahon, Kevin, Sun, Yu
Traditional autonomous driving methods adopt a modular design, decomposing tasks into sub-tasks. In contrast, end-to-end autonomous driving directly outputs actions from raw sensor data, avoiding error accumulation. However, training an end-to-end model requires a comprehensive dataset; otherwise, the model exhibits poor generalization capabilities. Recently, large language models (LLMs) have been applied to enhance the generalization capabilities of end-to-end driving models. Most studies explore LLMs in an open-loop manner, where the output actions are compared to those of experts without direct feedback from the real world, while others examine closed-loop results only in simulations. This paper proposes an efficient architecture that integrates multimodal LLMs into end-to-end driving models operating in closed-loop settings in real-world environments. In our architecture, the LLM periodically processes raw sensor data to generate high-level driving instructions, effectively guiding the end-to-end model, even at a slower rate than the raw sensor data. This architecture relaxes the trade-off between the latency and inference quality of the LLM. It also allows us to choose from a wide variety of LLMs to improve high-level driving instructions and minimize fine-tuning costs. Consequently, our architecture reduces data collection requirements because the LLMs do not directly output actions; we only need to train a simple imitation learning model to output actions. In our experiments, the training data for the end-to-end model in a real-world environment consists of only simple obstacle configurations with one traffic cone, while the test environment is more complex and contains multiple obstacles placed in various positions. Experiments show that the proposed architecture enhances the generalization capabilities of the end-to-end model even without fine-tuning the LLM.
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > New York > Suffolk County > Stony Brook (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- (2 more...)
- Transportation > Ground > Road (1.00)
- Information Technology > Robotics & Automation (1.00)
- Automobiles & Trucks (1.00)
ING-VP: MLLMs cannot Play Easy Vision-based Games Yet
Zhang, Haoran, Guo, Hangyu, Guo, Shuyue, Cao, Meng, Huang, Wenhao, Liu, Jiaheng, Zhang, Ge
As multimodal large language models (MLLMs) continue to demonstrate increasingly competitive performance across a broad spectrum of tasks, more intricate and comprehensive benchmarks have been developed to assess these cutting-edge models. These benchmarks introduce new challenges to core capabilities such as perception, reasoning, and planning. However, existing multimodal benchmarks fall short in providing a focused evaluation of multi-step planning based on spatial relationships in images. To bridge this gap, we present ING-VP, the first INteractive Game-based Vision Planning benchmark, specifically designed to evaluate the spatial imagination and multi-step reasoning abilities of MLLMs. ING-VP features 6 distinct games, encompassing 300 levels, each with 6 unique configurations. A single model engages in over 60,000 rounds of interaction. The benchmark framework allows for multiple comparison settings, including image-text vs. text-only inputs, single-step vs. multi-step reasoning, and with-history vs. without-history conditions, offering valuable insights into the model's capabilities. We evaluated numerous state-of-the-art MLLMs, with the highest-performing model, Claude-3.5 Sonnet, achieving an average accuracy of only 3.37%, far below the anticipated standard. This work aims to provide a specialized evaluation framework to drive advancements in MLLMs' capacity for complex spatial reasoning and planning. The code is publicly available at https://github.com/Thisisus7/ING-VP.git.
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.95)
A Quality Diversity Approach to Automatically Generate Multi-Agent Path Finding Benchmark Maps
Qian, Cheng, Zhang, Yulun, Bhatt, Varun, Fontaine, Matthew Christopher, Nikolaidis, Stefanos, Li, Jiaoyang
We use the Quality Diversity (QD) algorithm with Neural Cellular Automata (NCA) to generate benchmark maps for Multi-Agent Path Finding (MAPF) algorithms. Previously, MAPF algorithms are tested using fixed, human-designed benchmark maps. However, such fixed benchmark maps have several problems. First, these maps may not cover all the potential failure scenarios for the algorithms. Second, when comparing different algorithms, fixed benchmark maps may introduce bias leading to unfair comparisons between algorithms. In this work, we take advantage of the QD algorithm and NCA with different objectives and diversity measures to generate maps with patterns to comprehensively understand the performance of MAPF algorithms and be able to make fair comparisons between two MAPF algorithms to provide further information on the selection between two algorithms. Empirically, we employ this technique to generate diverse benchmark maps to evaluate and compare the behavior of different types of MAPF algorithms such as bounded-suboptimal algorithms, suboptimal algorithms, and reinforcement-learning-based algorithms. Through both single-planner experiments and comparisons between algorithms, we identify patterns where each algorithm excels and detect disparities in runtime or success rates between different algorithms.
- North America > United States > California (0.14)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Transportation (0.68)
- Information Technology (0.46)
Can Large Language Models Create New Knowledge for Spatial Reasoning Tasks?
Greatrix, Thomas, Whitaker, Roger, Turner, Liam, Colombo, Walter
The potential for Large Language Models (LLMs) to generate new information offers a potential step change for research and innovation. This is challenging to assert as it can be difficult to determine what an LLM has previously seen during training, making "newness" difficult to substantiate. In this paper we observe that LLMs are able to perform sophisticated reasoning on problems with a spatial dimension, that they are unlikely to have previously directly encountered. While not perfect, this points to a significant level of understanding that state-of-the-art LLMs can now achieve, supporting the proposition that LLMs are able to yield significant emergent properties. In particular, Claude 3 is found to perform well in this regard.
- Asia > Middle East > Jordan (0.04)
- North America > Canada > Manitoba > Westman Region > Brandon (0.04)
Landslide Topology Uncovers Failure Movements
Rana, Kamal, Bhuyan, Kushanav, Ferrer, Joaquin Vicente, Cotton, Fabrice, Ozturk, Ugur, Catani, Filippo, Malik, Nishant
Eery year, landslides cause economic damages worth 20 billion US dollars [1], and between 2004 and 2019 non-seismic landslides alone caused about 70, 000 fatalities worldwide [2]. Within the first two months of 2023, we have seen reports of devastating landslides in São Paulo, Brazil [3], Southern Peru [4], and New Zealand [5], injuring many and killing approximately 70 people. Adding to this, recent studies count over one million landslide occurrences with annual volumes estimated at fifty-six billion cubic meters globally [6], presenting a risk to sixty million people [7]. With the increase in urbanization, global climate change, and environmental change trends, the frequency of landslides and the associated risks will keep increasing globally over time [7]. In line with this, landslides are anticipated to evolve and remobilize with increased frequency under changing climatic conditions on a decadal scale [8, 9]. Our ability to identify hazards from emerging landslides and dynamically assess impact areas is essential in averting risk to rapidly urbanizing communities and adapting to changing environmental conditions [10, 7]. To address the rising landslide risk, predictive models for hazard, risk, and early warning systems are developed which assist in forecasting landslide occurrences and locating landslide-prone regions to mitigate the associated impacts [11]. However, the efficacy of these models is contingent on the quality of the underlying landslide databases.
- South America > Peru (0.24)
- Oceania > New Zealand (0.24)
- South America > Brazil > São Paulo (0.24)
- (10 more...)
- Energy (0.93)
- Government > Regional Government > North America Government > United States Government (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Software > Programming Languages (0.92)
- Information Technology > Data Science (0.89)
Framework for 2D Ad placements in LinearTV
Bhargavi, Divya, Sindwani, Karan, Gholami, Sia
Virtual Product placement(VPP) is the advertising technique of digitally placing a branded object into the scene of a movie or TV show. This type of advertising provides the ability for brands to reach consumers without interrupting the viewing experience with a commercial break, as the products are seen in the background or as props. Despite this being a billion-dollar industry, ad rendering technique is currently executed at post production stage, manually either with the help of VFx artists or through semi-automated solutions. In this paper, we demonstrate a fully automated framework to digitally place 2-D ads in linear TV cooking shows captured using single-view camera with small camera movements. Without access to full video or production camera configuration, this framework performs the following tasks (i) identifying empty space for 2-D ad placement (ii) kitchen scene understanding (iii) occlusion handling (iv) ambient lighting and (v) ad tracking.
- Media > Television (0.55)
- Media > Film (0.55)
- Media > Photography (0.34)
Emerging cooperation on the road by myopic local interactions
Rabinovich, Dmitry, Bruckstein, Alfred M.
In recent years the research in the field of autonomous vehicles has gained considerable momentum, and the idea of relieving the burden of driving from humans starts to lose its futuristic science fiction aura. Some people believe that autonomous traffic is "our last hope" of relief from the frequent road-jams, we now witness in even mid-size urban areas. We envision roads of the future with fully autonomous vehicles, that not only track the lane, keep safe distance and assist the driver, but essentially liberate humans from driving related activities altogether.
- Asia > Middle East > Israel (0.04)
- North America > United States > California (0.04)
- Europe > United Kingdom > England (0.04)
- Asia > China > Beijing > Beijing (0.04)